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Abstract 

The first step in most empirical work in 
multilingual NLP is to construct maps of 
the correspondence between texts and their 
translations (bitext maps). The Smooth 
Injective Map Recognizer (SIMR) algo- 
rithm presented here is a generic pattern 
recognition algorithm that is particularly 
well-suited to mapping bitext correspon- 
dence. SIMR is faster and significantly 
more accurate than other algorithms in the 
literature. The algorithm is robust enough 
to use on noisy texts, such as those result- 
ing from OCR input, and on translations 
that are not very literal. SIMR encap- 
sulates its language-specific heuristics, so 
that it can be ported to any language pair 
with a minimal effort. 



1 Introduction 

Texts that are available in two languages (bitexts) 
are immensely valuable for many natural language 
processing applications 1 . Bitexts are the raw ma- 
terial from which translation models are built. In 
addition to their use in machine translation (Sato 
& Nagao, 1990; Brown et a!., 1993; Melamed, 
1997), translation models can be applied to machine- 
assisted translation (Sato, 1992; Foster et al., 1996), 
cross-lingual information retrieval (SIGIR, 1996), 
and gisting of World Wide Web pages (Resnik, 
1997). Bitexts also play a role in less auto- 
mated applications such as concordancing for bilin- 
gual lexicography (Catizone et al., 1993; Gale & 
Church, 1991b), computer-assisted language learn- 
ing, and tools for translators (e.g. (Macklovitch, 

1 "Multitexts" in more than two languages are even 
more valuable, but they are much more rare. 



1995; Melamed, 1996b). However, bitexts are of lit- 
tle use without an automatic method for construct- 
ing bitext maps. 

Bitext maps identify corresponding text units be- 
tween the two halves of a bitext. The ideal bitext 
mapping algorithm should be fast and accurate, use 
little memory and degrade gracefully when faced 
with translation irregularities like omissions and in- 
versions. It should be applicable to any text genre 
in any pair of languages. 

The Smooth Injective Map Recognizer (SIMR) al- 
gorithm presented in this paper is a bitext mapping 
algorithm that advances the state of the art on these 
criteria. The evaluation in Section 5 shows that 
SIMR's error rates are lower than those of other 
bitext mapping algorithms by an order of magni- 
tude. At the same time, its expected running time 
and memory requirements are linear in the size of the 
input, better than any other published algorithm. 

The paper begins by laying down SIMR's geomet- 
ric foundations and describing the algorithm. Then, 
Section 4 explains how to port SIMR to arbitrary 
language pairs with minimal effort, without rely- 
ing on genre-specific information such as sentence 
boundaries. The last section offers some insights 
about the optimal level of text analysis for mapping 
bitext correspondence. 

2 Bitext Geometry 

A bitext (Harris, 1988) comprises two versions of 
a text, such as a text in two different languages. 
Translators create a bitext each time they trans- 
late a text. Each bitext defines a rectangular 
bitext space, as illustrated in Figure 1. The width 
and height of the rectangle are the lengths of the 
two component texts, in characters. The lower left 
corner of the rectangle is the origin of the bitext 
space and represents the two texts' beginnings. The 
upper right corner is the terminus and represents 
the texts' ends. The line between the origin and the 
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origin x = character position in text 1 



Figure 1: a bitext space 

terminus is the main diagonal. The slope of the 
main diagonal is the bitext slope. 

Each bitext space contains a number of true 
points of correspondence (TPCs), other than 
the origin and the terminus. For example, if a token 
at position p on the x-axis and a token at position 
q on the y-axis are translations of each other, then 
the coordinate (p, q) in the bitext space is a TPC 2 . 
TPCs also exist at corresponding boundaries of text 
units such as sentences, paragraphs, and chapters. 
Groups of TPCs with a roughly linear arrangement 
in the bitext space are called chains. 

Bitext maps are 1-to-l functions in bitext 
spaces. A complete set of TPCs for a particular 
bitext is called a true bitext map (TBM). The 
purpose of a bitext mapping algorithm is to pro- 
duce bitext maps that are the best possible approx- 
imations of each bitext's TBM. 

3 SIMR 

SIMR builds bitext maps one chain at a time. The 
search for each chain alternates between a genera- 
tion phase and a recognition phase. The genera- 
tion phase begins in a small rectangular region of 
the bitext space, whose diagonal is parallel to the 
main diagonal. Within this search rectangle, SIMR 
generates all the points of correspondence that sat- 
isfy the supplied matching predicate, as explained 
in Section 3.1. In the recognition phase, SIMR 
calls the chain recognition heuristic to find suitable 
chains among the generated points. If no suitable 
chains are found, the search rectangle is proportion- 
ally expanded and the generation-recognition cycle 

2 Since distances in the bitext space are measured in 
characters, the position of a token is defined as the mean 
position of its characters. 



is repeated. The rectangle keeps expanding until at 
least one acceptable chain is found. If more than 
one chain is found in the same cycle, SIMR accepts 
the one whose points are least dispersed around its 
least-squares line. Each time SIMR accepts a chain, 
it selects another region of the bitext space to search 
for the next chain. 

SIMR employs a simple heuristic to select regions 
of the bitext space to search. To a first approxima- 
tion, TBMs are monotonically increasing functions. 
This means that if SIMR finds one chain, it should 
look for others either above and to the right or below 
and to the left of the one it has just found. All SIMR 
needs is a place to start the trace. A good place to 
start is at the beginning: Since the origin of the 
bitext space is always a TPC, the first search rect- 
angle is anchored at the origin. Subsequent search 
rectangles are anchored at the top right corner of 
the previously found chain, as shown in Figure 2. 




Figure 2: SIMR's "expanding rectangle" search 
strategy. The search rectangle is anchored at the top 
right corner of the previously found chain. Its diag- 
onal remains parallel to the main diagonal. 

The expanding-rectangle search strategy makes 
SIMR robust in the face of TBM discontinuities. 
Figure 2 shows a segment of the TBM that contains 
a vertical gap (an omission in the text on the x-axis) . 
As the search rectangle grows, it will eventually in- 
tersect with the TBM, even if the discontinuity is 
quite large (Melamed, 1996b). The noise filter de- 
scribed in Section 3.3 prevents SIMR from being led 
astray by false points of correspondence. 

3.1 Point Generation 

SIMR generates candidate points of correspondence 
in the search rectangle using one of its matching 
predicates. A matching predicate is a heuristic 
for deciding whether a given pair of tokens are likely 
to be mutual translations. Two kinds of information 



that a matching predicate can rely on most often are 
cognates and translation lexicons. 

Two tokens in a bitext are cognates if they have 
the same meaning and similar spellings. In the non- 
technical Canadian Hansards (parliamentary debate 
transcripts available in English and in French), cog- 
nates can be found for roughly one quarter of all 
text tokens (Melamed, 1995). Even distantly related 
languages like English and Czech will share a large 
number of cognates in the form of proper nouns. 
Cognates are more common in bitexts from more 
similar language pairs, and from text genres where 
more word borrowing occurs, such as technical texts. 
When dealing with language pairs that have dissim- 
ilar alphabets, the matching predicate can employ 
phonetic cognates (Melamed, 1996a). When one 
or both of the languages involved is written in pic- 
tographs, cognates can still be found among punc- 
tuation and digit strings. However, cognates of this 
last kind arc usually too sparse to suffice by them- 
selves. 

When the matching predicate cannot generate 
enough candidate correspondence points based on 
cognates, its signal can be strengthened by a trans- 
lation lexicon. Translation lexicons can be ex- 
tracted from machine-readable bilingual dictionaries 
(MRBDs), in the rare cases where MRBDs are avail- 
able. In other cases, they can be constructed auto- 
matically or semi-automatically using any of several 
methods (Fung, 1995; Melamed, 1996c; Resnik & 
Melamed, 1997). Since the matching predicate need 
not be perfectly accurate, the translation lexicons 
need not be either. 

Matching predicates can take advantage of other 
information, besides cognates and translation lex- 
icons. For example, a list of faux amis is a use- 
ful complement to a cognate matching strategy 
(Macklovitch, 1995). A stop list of function words is 
also helpful. Function words are translated inconsis- 
tently and make unreliable points of correspondence 
(Melamed, 1996a). 

3.2 Point Selection 

As illustrated in Figure 2, even short sequences of 
TPCs form characteristic patterns. Most chains of 
TPCs have the following properties: 

• Linearity: TPCs tend to line up straight. 

• Low Variance of Slope: The slope of a TPC 
chain is rarely much different from the bitext 
slope. 

• Injectivity: No two points in a chain of TPCs 
can have the same x- or y-co-ordinates. 



SIMR's chain recognition heuristic exploits these 
properties to decide which chains in the search rect- 
angle might be TPC chains. 

The heuristic involves three parameters: chain 
size, maximum point dispersal and maximum 
angle deviation. A chain's size is simply the num- 
ber of points it contains. The heuristic considers 
only chains of exactly the specified size whose points 
are injective. The linearity of the these chains is 
tested by measuring the root mean squared distance 
of the chain's points from the chain's least-squares 
line. If this distance exceeds the maximum point 
dispersal threshold, the chain is rejected. Next, the 
angle of each chain's least-squares line is compared 
to the arctangent of the bitext slope. If the differ- 
ence exceeds the maximum angle deviation thresh- 
old, the chain is rejected. These filters can be effi- 
ciently combined so that SIMR's expected running 
time and memory requirements are linear in the size 
of the input bitext (Melamed, 1996a). 

The chain recognition heuristic pays no attention 
to whether chains are monotonic. Non-monotonic 
TPC chains are quite common, because even lan- 
guages with similar syntax like French and English 
have well-known differences in word order. For ex- 
ample, English (adjective, noun) pairs usually corre- 
spond to French (noun, adjective) pairs. Such inver- 
sions result in TPCs arranged like the middle two 
points in the "previous chain" of Figure 2. SIMR 
has no problem accepting the inverted points. 

If the order of words in a certain text passage is 
radically altered during translation, SIMR will sim- 
ply ignore the words that "move too much" and con- 
struct chains out of those that remain more station- 
ary. The maximum point dispersal parameter lim- 
its the width of accepted chains, but nothing lim- 
its their length. In practice, the chain recognition 
heuristic often accepts chains that span several sen- 
tences. The ability to analyze non-monotonic points 
of correspondence over variable-size areas of bitext 
space makes SIMR robust enough to use on transla- 
tions that are not very literal. 

3.3 Noise Filter 

Points of correspondence among frequent token 
types often line up in rows and columns, as illus- 
trated in Figure 3. Token types like the English 
article "a" can produce one or more correspondence 
points for almost every sentence in the opposite text. 
Only one point of correspondence in each row and 
column can be correct; the rest are noise. A noise fil- 
ter can make it easier for SIMR to find TPC chains. 

Other bitext mapping algorithms mitigate this 
source of noise either by assigning lower weights to 
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Figure 3: Frequent tokens cause false points of cor- 
respondence that line up in rows and columns. 



correspondence points associated with frequent to- 
ken types (Church, 1993) or by deleting frequent to- 
ken types from the bitext altogether (Dagan et al., 
1993). However, a token type that is relatively fre- 
quent overall can be rare in some parts of the text. 
In those parts, the token type can provide valuable 
clues to correspondence. On the other hand, many 
tokens of a relatively rare type can be concentrated 
in a short segment of the text, resulting in many 
false correspondence points. The varying concentra- 
tion of identical tokens suggests that more localized 
noise filters would be more effective. SIMR's local- 
ized search strategy provides a vehicle for a localized 
noise filter. 

The filter is based on the maximum point am- 
biguity level parameter. For each point p = (x, y), 
let X be the number of points in column x within 
the search rectangle, and let Y be the number of 
points in row y within the search rectangle. Then 
the ambiguity level of p is X + Y - 2. In partic- 
ular, if p is the only point in its row and column, 
then its ambiguity level is zero. The chain recogni- 
tion heuristic ignores points whose ambiguity level is 
too high. What makes this a localized filter is that 
only points within the search rectangle count toward 
each other's ambiguity level. The ambiguity level of 
a given point can change when the search rectangle 
expands or moves. 

The noise filter ensures that false points of corre- 
spondence are very sparse, as illustrated in Figure 4. 
Even if one chain of false points of correspondence 
slips by the chain recognition heuristic, the expand- 
ing rectangle will find its way back to the TBM be- 
fore the chain recognition heuristic accepts another 
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Figure 4: SIMR's noise filter ensures that TPCs 
are much more dense than false points of correspon- 
dence. A good signal-to-noise ratio prevents SIMR 
from getting lost. 



chain. If the matching predicate generates a reason- 
ably strong signal then the signal-to-noise ratio will 
be high and SIMR will not get lost, even though it 
is a greedy algorithm with no ability to look ahead. 

4 Porting to New Language Pairs 

SIMR can be ported to a new language pair in three 
steps. 

4.1 Step 1: Construct Matching Predicate 

The original SIMR implementation for 
French/English included matching predicates that 
could use cognates and/or translation lexicons. For 
language pairs in which lexical cognates are frequent, 
a cognate-based matching predicate should suffice. 
In other cases, a "seed" translation lexicon may be 
used to boost the number of candidate points pro- 
duced in the generation phase of the search. The 
SIMR implementation for Spanish/English uses only 
cognates. For Korean/English, SIMR takes advan- 
tage of punctuation and number cognates but sup- 
plements them with a small translation lexicon. 

4.2 Step 2: Construct Axis Generators 

In order for SIMR to generate candidate points of 
correspondence, it needs to know what token pairs 
correspond to co-ordinates in the search rectangle. 
It is the axis generator's job to map the two halves 
of the bitext to positions on the x- and y-axes of 
the bitext space, before SIMR starts searching for 
chains. This mapping should be done with the 
matching predicate in mind. 

If the matching predicate uses cognates, then ev- 
ery word that might have a cognate in the other 
half of the bitext should be assigned its own axis 



position. This rule applies to punctuation and num- 
bers as well as to "lexical" cognates. In the case of 
lexical cognates, the axis generator typically needs 
to invoke a language-specific tokenization program 
to identify words in the text. Writing such a pro- 
gram may constitute a significant part of the port- 
ing effort, if no such program is available in advance. 
The effort may be lessened, however, by the realiza- 
tion that it is acceptable for the tokenization pro- 
gram to overgenerate just as it is acceptable for the 
matching predicate. For example, when tokcnizing 
German text, it is not necessary for the tokenizer 
to know which words are compounds. A word that 
has another word as a substring should result in one 
axis position for the substring and one for the su- 
perstring. 

When lexical cognates are not being used, the axis 
generator only needs to identify punctuation, num- 
bers, and those character strings in the text which 
also appear on the relevant side of the translation 
lexicon 3 . It would be pointless to plot other words 
on the axes because the matching predicate could 
never match them anyway. Therefore, for languages 
like Chinese and Japanese, which are written with- 
out spaces between words, tokenization boils down 
to string matching. In this manner, SIMR circum- 
vents the difficult problem of word identification in 
these languages. 

4.3 Step 3: Re-optimize Parameters 

The last step in the porting process is to re-optimize 
SIMR's numerical parameters. The four parameters 
described in Section 3 interact in complicated ways, 
and it is impossible to find a good parameter set 
analytically. It is easier to optimize these parameters 
empirically, using simulated annealing (Vidal, 1993). 

Simulated annealing requires an objective func- 
tion to optimize. The objective function for bitext 
mapping should measure the difference between the 
TBM and maps produced with the current parame- 
ter set. In geometric terms, the difference is a dis- 
tance. The TBM consists of a set of TPCs. The 
error between a bitext map and each TPC can be 
defined as the horizontal distance, the vertical dis- 
tance, or the distance perpendicular to the main di- 
agonal. The first two alternatives would minimize 
the error with respect to only one language or the 
other. The perpendicular distance is a more robust 
average. In order to penalize large errors more heav- 
ily, root mean squared (RMS) distance is minimized 
instead of mean distance. 

3 Multi-word expressions in the translation lexicon are 
treated just like any other character string. 



The most tedious part of the porting process is the 
construction of TBMs against which SIMR's param- 
eters can be optimized and tested. The easiest way 
to construct these gold standards is to extract them 
from pairs of hand-aligned text segments: The final 
character positions of each segment in an aligned 
pair are the co-ordinates of a TPC. Over the course 
of two porting efforts, I have developed and refined 
tools and methods that allow a bilingual annota- 
tor to construct the required TBMs very efficiently 
from a raw bitext. For example, a tool originally de- 
signed for automatic detection of omissions in trans- 
lations (Melamed, 1996b) was adopted to detect mis- 
alignments. 

4.4 Porting Experience Summary 

Table I summarizes the amount of time invested 
in each new language pair. The estimated times 
for building axis generators do not include the time 
spent to build the English axis generator, which was 
part of the original implementation. Axis generators 
need to be built only once per language, rather than 
once per language pair. 

5 Evaluation 

SIMR was evaluated on hand-aligned bitexts of vari- 
ous genres in three language pairs. None of these test 
bitexts were used anywhere in the training or port- 
ing procedures. Each test bitext was converted to a 
set of TPCs by noting the pair of character positions 
at the end of each aligned pair of text segments. The 
test metric was the root mean squared distance, in 
characters, between each TPC and the interpolated 
bitext map produced by SIMR, where the distance 
was measured perpendicular to the main diagonal. 
The results are presented in Table 2. 

The French/English part of the evaluation was 
performed on bitexts from the publicly available 
BAF corpus created at CITI (Simard & Plamon- 
don, 1996). SIMR's error distribution on the "parlia- 
mentary debates" bitext in this collection is given in 
Table 3. This distribution can be compared to error 
distributions reported in (Church, 1993) and in (Da- 
gan et al., 1993). SIMR's RMS error on this bitext 
was 5.7 characters. Church's char_align algorithm 
(Church, 1993) is the only algorithm that does not 
use sentence boundary information for which com- 
parable results have been reported. char_align's 
RMS error on this bitext was 57 characters, exactly 
ten times higher. 

Two teams of researchers have reported results 
on the same "parliamentary debates" bitext for al- 
gorithms that map correspondence at the sentence 
level (Gale & Church, 1991a; Simard et al., 1992). 



Table 1: Time spent in constructing two "gold standard" TBMs. 



language pair 


main informant for 
matching predicate 


estimated time 
spent to build 
new axis generator 


estimated time 

spent on 
hand-alignment 


number of 
segments 
aligned 


Spanish/English 
Korean/English 


lexical cognates 
translation lexicon 


8 h 
6 h 


5 h 
12 h 


1338 
1224 



Table 2: SIMR accuracy on different text genres in three language pairs. 



language 


number of 




number of 


RMS Error 


pair 


training TPCs 


genre 


test TPCs 


in characters 


French / English 


598 


parliamentary debates 


7123 


5.7 






CITI technical reports 


365, 305, 176 


4.4, 2.6, 9.9 






other technical reports 


561, 1393 


20.6, 14.2 






court transcripts 


1377 


3.9 






U.N. annual report 


2049 


12.36 






I.L.O. report 


7129 


6.42 


Spanish / English 


562 


software manuals 


376, 151, 100, 349 


4.7, 1.3, 6.6, 4.9 


Korean / English 


615 


military manuals 


40, 88, 186, 299 


2.6, 7.1, 25, 7.8 






military messages 


192 


0.53 



Table 3: SIMR's error distribution on the 
French/ English "parliamentary debates" bitext. 



number of 


error range 


fraction of 


test points 


in characters 


test points 


1 


-101 


.0001 


2 


-80 to -70 


.0003 


1 


-70 to -60 


.0001 


5 


-60 to -50 


.0007 


4 


-50 to -40 


.0006 


6 


-40 to -30 


.0008 


9 


-30 to -20 


.0013 


29 


-20 to -10 


.0041 


3057 


-10 to 


.4292 


3902 


to 10 


.5478 


43 


10 to 20 


.0060 


28 


20 to 30 


.0039 


17 


30 to 40 


.0024 


5 


40 to 50 


.0007 


8 


50 to 60 


.0011 


1 


60 to 70 


.0001 


1 


70 to 80 


.0001 


1 


80 to 90 


.0001 


1 


90 to 100 


.0001 


1 


110 to 120 


.0001 


1 


185 


.0001 


7123 




1.000 



Both of these algorithms use sentence boundary 
information. Melamed (1996a) showed that sen- 
tence boundary information can be used to convert 
SIMR's output into sentence alignments that are 
more accurate than those obtained by either of the 
other two approaches. 

The test bitexts in the other two language pairs 
were created when SIMR was being ported to those 
languages. The Spanish/English bitexts were drawn 
from the on-line Sun MicroSystems Solaris An- 
swcrBooks. The Korean/English bitexts were pro- 
vided and hand-aligned by Young-Suk Lee of MIT's 
Lincoln Laboratories. Although it is not possible 
to compare SIMR's performance on these language 
pairs to the performance of other algorithms, Table 2 
shows that the performance on other language pairs 
is no worse than performance on French/English. 

6 Which Text Units to Map? 

Early bitext mapping algorithms focused on sen- 
tences (Kay & Roscheisen, 1993; Debili & Sam- 
mouda, 1992). Although sentence maps do not have 
sufficient resolution for some important bitext appli- 
cations (Melamed, 1996b; Macklovitch, 1995), sen- 
tences were an easy starting point, because their 
order rarely changes during translation. Therefore, 
sentence mapping algorithms need not worry about 
crossing correspondences. In 1991, two teams of re- 
searchers independently discovered that sentences 
can be accurately aligned by matching sequences 



with similar lengths (Gale & Church, 1991a; Brown 
ct al., 1991). 

Soon thereafter, Church (1993) found that bitext 
mapping at the sentence level is not an option for 
noisy bitexts found in the real world. Sentences 
are often difficult to detect, especially where punc- 
tuation is missing due to OCR errors. More im- 
portantly, bitexts often contain lists, tables, titles, 
footnotes, citations and/or mark-up codes that foil 
sentence alignment methods. Church's solution was 
to look at the smallest of text units — characters 
- and to use digital signal processing techniques 
to grapple with the much larger number of text 
units that might match between the two halves of 
a bitext. Characters match across languages only to 
the extent that they participate in cognates. Thus, 
Church's method is only applicable to language pairs 
with similar alphabets. 

The main insight of the present work is that words 
are a happy medium-sized text unit at which to map 
bitext correspondence. By situating word positions 
in a bitext space, the geometric heuristics of sen- 
tence alignment algorithms can be exploited equally 
well at the word level. The cognate heuristic of 
the character-based algorithms works better at the 
word level, because cognateness can be defined more 
precisely in terms of words, e.g. using the Longest 
Common Subsequence Ratio (Melamed, 1995). Sev- 
eral other matching heuristics can only be applied 
at the word level, including the localized noise filter 
in Section 3.3, lists of stop words and lists of faux 
amis (Macklovitch, 1995). Most importantly, trans- 
lation lexicons can only be used at the word level. 
SIMR can employ a small hand-constructed transla- 
tion lexicon to map bitexts in any pair of languages, 
even when the cognate heuristic is not applicable and 
sentences cannot be found. The particular combina- 
tion of heuristics described in Section 3 can certainly 
be improved on, but research into better bitext map- 
ping algorithms is likely to be most fruitfull at the 
word level. 

7 Conclusion 

The Smooth Injcctive Map Recognizer (SIMR) 
bitext mapping algorithm advances the state of the 
art on several frontiers. It is significantly more ac- 
curate than other algorithms in the literature. Its 
expected running time and memory requirements 
are linear in the size of the input, which makes 
it the algorithm of choice for very large bitexts. 
It is not fazed by word order differences. It does 
not rely on pre-segmented input and is portable to 
any pair of languages with a minimal effort. These 



features make SIMR the mostly widely applicable 
bitext mapping algorithm to date. 

SIMR opens up several new avenues of research. 
One important application of bitext maps is the con- 
struction of translation lexicons (Dagan et al., 1993) 
and, as discussed, translation lexicons are an impor- 
tant information source for bitext mapping. It is 
likely that the accuracy of both kinds of algorithms 
can be improved by alternating between the two on 
the same bitext. There are also plans to build an 
automatic bitext locating spider for the World Wide 
Web, so that SIMR can be applied to more new lan- 
guage pairs and bitext genres. 
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